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Abstract. Parser generators generate translators from language specifications. In 
many cases, such specifications contain semantic actions written in the same language as 
the generated code. Since these actions are subject to little static checking, they are usually 
a source of errors which are discovered only when generated code is compiled. 

In this paper we propose a parser generator front-end which statically checks seman- 
tic actions for typing errors and prevents such errors from appearing in generated code. 
The type checking procedure is extensible to support many implementation languages. 
An extension for Java is presented along with an extension for declarative type system 
descriptions. 
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1. Introduction. Parser generators have been used for decades to create 
translators and other language processing tools. A parser generator is a tool that 
reads a specification of a language and generates code which is capable of check- 
ing its input text for syntactical correctness and construct an internal representa- 
tion from it. This process involves several languages (see Figure [TJ: the specifi- 
cation is written in a specification language and describes the parsed language, 
while the generated code (which recognizes the parsed language) is written in an 
implementation language (e.g., Java or C). 

A specification language usually includes a notation for context-free gram- 
mars and some means to specify semantic actions which are the computations 
translating a program into the internal representation. In this paper we consider 
on-line parser generators such as Yacc ^Johnson, 1979| ), ANTLR ( |Parr, 2001) , 
COCO/R ( |Mossenbock, 1990| ) and JavaCC ( |Kodaganallur, 2004| ), which are 



characterized by having the semantic actions defined on the concrete grammar, 
as opposed to attribute grammar (AG) systems such as Eli ( |Gray, 1992} and 
JastAdd (He din, 2003| l which define the computations on abstract syntax trees 
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Fig. 1. Generator structure and languages involved 



(ASTs) and usually require the complete input to be parsed before the computa- 
tions are started. 

A parser generator translates a grammar specification into a parser and in- 
tegrates user-defined semantic actions into it. Since the user may make mistakes 
while writing the actions, this frequently leads to generating code that contains 
errors which are reported only by the implementation language compiler. We will 
now illustrate this problem with an example of a simple language of arithmetic 
expressions with variables. The ANTLR grammar for this language is shown in 
Listing[T](no semantic actions are presented at this point). 

// Lexical rules 

fragment LETTER : 'a'..'z' | 'A'..'Z' | '_' ; 

fragment DIGIT : '0'..'9' ; 

VAR : LETTER (LETTER | DIGIT) * ; 

INT : DIGIT+ ; 

// Syntactic rules 

expr : term (('+' I ') term)* ; 
term : factor ('*' factor)* ; 
factor : VAR | INT | '(' expr ')' ; 

Listing 1 : An ANTLR grammar for arithmetic expressions 

Let us consider the case when one uses ANTLR to develop a parser that 
checks the syntax and evaluates the expressions in a given environment. Evalu- 
ation of an expression is a special case of translation: the expression in essen- 
tially translated into a number. An environment stores values for all variables 
referenced in the expression. For example, if the parser is run on the input text 
"x* ( 3 + 2 ) " in the environment [x=4], it accepts the input and returns 20. When 
run on " (x+*3) ", it does not accept (raises an exception), because the input is 
not syntactically correct. Note that the notation in Listing [TJ does not describe 
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environments: an environment is passed as a separate argument to the parser (see 
the examples below). 

To give an example of an error which may appear in the generated code, let 
us consider the following set of semantic actions for the rule factor: 

factor [Environment env] returns [int result] 

: VAR { result = en v. get Val ue ($VAR . get Text () ) ; } 
I INT { result = $INT; } 

I ' (' e=expr[env] ')' { result = e; } ; 

When we run ANTLR on a specification containing this rule it will successfully 
generate code. The code generated for the second alternative will contain the 
following lines: 

int result = 0; 
// . . . 

Token INT2=null; 
// . . . 

result = INT2; 

The Java compiler yields an error message at the last line: a value of type Token 
can not be assigned to a variable of type int. Now we have to figure out that the 
cause of this error is that we forgot to extract the contents of the token by calling 
the getText ( ) method on $ INT in the specification. Let us correct this error: 

I INT I result = $ INT .getText () ; } 

We run ANTLR again and in the generated code the erroneous line changes to 
the following: 

result = INT2 . getText () ; 

The Java compiler complains again at the same line: a String can not be as- 
signed to an int variable. We have to correct the specification again: 

INT { result = Integer . parselnt ( $INT . getText ()) ; } 

After generating the code again, we can see that it compiles successfully. 

This process took us three complete runs of ANTLR, each followed by a 
compilation attempt, and we had to analyze the generated code twice to figure 
out which part of the specification causes the error. The development process has 
a form of the cycle shown in Figure [2] (left side). The generated code is usually 
hard to read and, in case of long specifications, code generation may take up to 
several seconds. Thus, this cycle may be time-consuming. 

The main motivation of the present study is to reduce the development cycle 
to the one given in Figure [2] (right side). It corresponds to the usual way of work- 
ing with compilers: change-compile-errors-change. This paper aims at achieving 
this goal by improving static checking of parser specifications. 
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A parser generator may be logically divided into two parts (see Figure [TJ: 
a front-end which reads a specification, performs the static checks and builds 
an internal representation of a parser, and a back-end which generates code. To 
support the shortened development cycle, only the front-end may be allowed to 
report errors to the user. The back-end must silently produce code which must be 
error-free as soon as the front-end did not find any errors in the specification. 

Most generators have front-ends which check only for errors which prevent 
them from building the internal representation, such as usage of undefined names. 
This leads to the problem described above. 

In this paper we present an extensible specification language and a front-end 
infrastructure which detects type incompatibilities in assignments and function 
calls. In combination with name-screening and import statement generation in the 
back-end, this prevents all types of compiler errors for implementation languages 
like Java, C#, C or Pascal. 

Our approach is designed under the requirement of being applicable for 
many implementation languages, and thus the generic specification language may 
be extended to support a particular type system. We present an implementation of 
such a type system for Java and a generic language which allows one to specify 
simple type systems declaratively. 

We report on GRAMMATIC PG — a prototype implementation of our ap- 
proach built on top of ANTLR. GRAMMATIC PG specifications are type-checked 
and transformed into ANTLR specifications. The ANTLR tool-chain can then 
be used to generate the parser. 

The rest of the paper is organized as follows. Section|2]gives an overview of 
our specification language illustrating the language design and basic constructs. 
Extensible type system of GRAMMATIC PG and a demonstration of how the de- 
velopment cycle changes with it is presented in Section [3] This type system is 
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parameterized by the types of an implementation language and corresponding 
subtyping rules, and thus it must be instantiated to be actually used. In Section|4] 
we present a mechanism for declaratively describing type systems of implemen- 
tation languages and instantiating GRAMMATIC PG using them. This mechanism 
describes the type systems as having a fixed number of types with subtyping rules 
specified explicitly. To provide a better integration with a particular implementa- 
tion language, one can write a custom extension. The extension mechanism and 
an example extension supporting Java are described in Section [5] As we men- 
tioned before, the front end provides an internal representation of a parser from 
which a back-end can generate error-free code. We show how the latter is done 
in our prototype in Section [6] Section [7] describes related work, and Section [8] 
summarizes our contribution and points out directions for future work. 

2. Overview of the specification language. Attribute grammars (AGs) in- 
troduced in ( [Knufh, 1968| ) are very convenient for describing translators, but in 
a general case they require the whole input to be parsed before the translation 
starts. This is why the most popular parser generators only use some concepts of 
AGs, but not the whole framework. 

GRAMMATIC PG specification language corresponds to a restricted version 
of AGs where each nonterminal symbol N is associated with a set of attributes, 
which is divided into 

• output attributes: computed by the productions defining N; 

• input attributes: computed by the productions which use N. 

Output attributes of GRAMMATIC PG directly correspond to synthesized at- 
tributes used in AGs, and input attributes represent a restricted case of inherited 
attributes. 

2.1. Translation functions and external functions. We use the terms "in- 
put" and "output" for attributes because they correspond very closely to inputs 
and outputs of functions. One can think of a syntactic rule defining a nonterminal 
(which is a set of all productions for this nonterminal) as a "translation function" . 
The productions which use the nonterminal call this function, passing the input 
attributes to is (as arguments). The function itself computes the output attributes 
and returns them to be used by the caller. A recursive descent parser implements 
this analogy in a one-to-one manner: for each nonterminal there is a correspond- 
ing function which takes the input attributes as parameters and returns the output 
attributes. 

In principle, there may be many translation functions corresponding to the 
same syntactic rule. In GRAMMATIC PG notation, one specifies these functions 
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right after the rule. A translation function is described by its signature (a name 
and lists of input and output attributes) and a body (this will be explained later), 
for example: 

N : ... ; // Syntactic rule 

translateN ( int in) --> (int out) { // Signature 
// Body 

} 

Note that all attributes are declared with their types. 

Being a recursive descent parser generator, ANTLR also uses a function 
analogy for translation, although it does not support many translation functions 
for the same rule. In ANTLR notation the example given above looks as follows: 

N[int in] returns [int out] : ... ; 

Unlike general AGs (and like ANTLR), GRAMMATIC PG prescribes the or- 
der of computations using translation schemes ( |Aho, 1986| . In other words, at- 
tributes are computed by semantic actions positioned somewhere inside produc- 
tions. The most popular notation for translation schemes (it is used by the major- 
ity if not all parser generators, including ANTLR) is the following: 

A — > B {action} C 

This notation is rather intuitive but in practice (e.g., in large ANTLR gram- 
mars) it makes the specifications unreadable because of mixing context-free pro- 
ductions and action code. To avoid this problem, in GRAMMATIC PG notation we 
separate the grammar productions from the actions by using a technique similar 
to AspectJ's advice ( [Kiczales, 2001| >. The above mentioned production in our 
notation can be written as follows: 

A : B C; 

translateA( . . . ) — > (...) { 

after B : action ; 

} 

The action is specified in the body of a translation function. The after key- 
word denotes that the action must be executed after the nonterminal B is matched. 
Another option to specify the same behaviour is to use the before keyword as fol- 
lows: 

before C : action ; 

The actions to the right of the ":" sign can assign values to attributes and call 
external functions to perform computations. External functions must be imple- 
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mented outside the specification, in GRAMMATIC they are represented only 
by their signatures, for example 

add(int x, int y) — > (int sum) ; 

A developer must supply the implementation written separately (in the imple- 
mentation language). 

In addition to specifying which actions to execute before or after a certain 
position, one needs to say which translation function to call for a nonterminal on 
the right-hand side of the syntactic rule, what arguments to pass to it and where 
to store the returned values. For this purpose we use the at keyword followed by 
a name of the nonterminal: 

at C : tmp = translateC (a, b) ; 

This, written inside a translation function for A, means that for the occurrence of 
C on the right-hand side of the syntactic rule for A we must call the translation 
function translateC, passing arguments a and b and writing the returned 
value to tmp. 

To illustrate some technical details of the GRAMMATIC PG notation, we will 
now proceed to the example of arithmetic expressions mentioned in the SectionQ] 

2.2. Example. The grammar for arithmetic expression is given in ListingQ] 
Our translation functions must evaluate an expression in a given environment 
which contains values for the variables used in the expression. To do this, we will 
need the appropriate external functions: 

strToInt (String s) — > (Int value); 

value (Environment env, String variable) — > (Int value); 

zero () — > (Int zero); 

one ( ) — > (Int one ) ; 

neg(Int x) — > (Int negx) ; 

add(Int x, Int y) — > (Int sum); 

mul(Int x, Int y) — > (Int prod); 

We do not fix an implementation language here: the example described in this 
section will work for any implementation language in which types can have 
names, and Int, String and Environment can be defined with the straight- 
forward meanings. 

We are going to specify three translation functions: one for each of the non- 
terminals factor, term and expr. Each of these functions must have an input 
attribute for the environment and an output attribute for the result. The translation 
function for factor is shown in Listing [2] 

The actions for VAR and INT use the notation "NAME#" which denotes 
a textual value of a token (this value is of type String). Note that value and 
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factor : VAR | INT | '(' expr ')' ; 

factor (Environment env) --> (Int result) { 

after VAR : result = value (env, VAR#) ; // External function 

after INT : result = strToInt ( INT# ) ; // External function 

at expr : result = expr (env); // Translation function 

} 

Listing 2: Translation function for factor 



strToInt which are called after the tokens, are external functions, while expr 
is a translation function called at the occurrence of the corresponding nontermi- 
nal. One can not do anything but a call to a translation function in at position. 
In some cases calls to translation functions (and thus, whole at-actions) may be 
omitted, but in our example this is not the case: we have to specify what argument 
to pass to the translation function expr. 

The rule for term is different from what we have shown by now in the 
sense that it has two occurrences of the same nonterminal factor. At both these 
occurrences we need to call the translation function factor defined above and 
pass the environment object to it. The result will be assigned to a local attribute f , 
which is used inside the translation function as an auxiliary storage. This attribute 
f is declared inside the translation function to have the type Int: 

Int f; 

at factor: f = factor (env) ; 

This action is common for both occurrences of factor, but after the value is ac- 
quired we must treat it differently in each case. To be able to distinguish between 
different occurrences of the same nonterminal, one can use location labels avail- 
able in GRAMMATIC PG . With the use of such labels, the translation function for 
factor looks as follows: 



term : $fl = factor ('*' $f2 = factor) * ; 

term (Environment env) --> (Int result) { 
Int f ; // Local attribute declaration 

at factor : f = factor (env); // For both occurrences 

after $fl : result = f; // Only for $fl 

after $f2 : result = mul (result, f ) ; // Only for $f2 

} 

Listing 3: Translation function for term 
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A label may be attached not only to a nonterminal occurrence, but also to 
any phrase in a production. This feature is rather convenient when defining a 
translation function for expr (see ListingHJ). This function also give an example 
of a block which groups several statements together (see the action for $tl). 

expr : $tl=term ( $sgn=('+' | '-') term)* ; 
expr (Environment env) — > (Int result) { 
at term : t = term (env) ; 

before $tl : { 

result = zero ( ) ; 
sign = one ( ) ; 

} 

after term : result = add(result, mul(sign, t) ) ; 
before $sgn : sign = one ( ) ; 
after : sign = neg(sign); 

} 

Listing 4: Translation function for expr 

To motivate again our choice of notation, we provide the same rule in 
ANTLR notation (in Listing [5}. As can be seen, it is rather hard to understand 
the structure of the syntactic rule from it, which is never the case for GRAM- 
MATIC PG . 

expr [Environment env] returns [int result] 
: {result = 0; sign = 1; } 
t=term[env] {result = t;} 
( 

{sign = X; } ('+' I '-' {sign = -1;}) 
t=term[env] {result += t * sign;} 

) * 

Listing 5: ANTLR rule for expr 



2.3. Multi-return functions. All functions used in our example have only 
one output attribute, but in general a function may return a tuple. For example, a 
function di vi de ( x , y ) may return two numbers: a quotient and a remainder. 
GRAMMATIC PG does not support tuple-typed attributes, and a return value of 
this function can not be assigned to a single attribute. Instead, GRAMMATIC 
supports attribute tuples. If one needs to receive a result of the divide function, 
it can be done in the following way: 
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Int quot; 
Int rem; 

(quot, rem) = divide (x, y) ; 

This code assigns the first component of a returned tuple to the attribute quot 
and the second — to the attribute rem. 

2.4. Attribute initialization. Before the first assignment, a value of an at- 
tribute is not defined and thus can not be read. To ensure that every attribute is 
initialized before the first usage, GRAMMATIC PG performs conventional data- 
flow analysis ( ]Khedker 72009 > . If the analysis finds a read-access which may not 
be preceded by a corresponding write-access, the front-end reports an error. This 
analysis relies on the construction of a control flow graph. GRAMMATIC PG does 
not support conditional operators and loops as such, and all the branching and 
repetition happens according to the structure of grammar rules which is denoted 
by the common regular operations: concatenation (sequence), alternative ("|") 
and iteration ("+")■ Optional constructs ("?" and "*") are viewed as alternatives 
with an empty option. 

A control flow graph is constructed as follows: concatenation corresponds 
to sequential execution, alternative corresponds to branching and iteration corre- 
sponds to a loop. Figure|3]shows a control flow graph for the Listing|4j edges are 
labeled with corresponding sequences of attribute reads and writes indicated as 
"[r]" and "[w]" respectively. 
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Fig. 3. Control flow graph for expr with attribute access labels 



2.5. Local type inference. Grammatic g checks translation functions 
for type safety: when a value is assigned to an attribute (passing as an argument 
can also be interpreted as an assignment to an input attribute of a function), the 
type of the right-hand side must be a subtype of the type of the left-hand side. 
This prevents GRAMMATIC PG from generating code with typing errors such as 
those we discussed in SectionQ] 

In ListinglU two attributes, env and result, are declared in the signature 
of the translation function to have types Environment and Int respectively, 
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but the intermediate attribute t is not declared anywhere. What type does it have? 
This is an example of local type inference which makes GRAMMATIC feel 
more like dynamic languages: if some attribute is used but not declared, the type 
checker assumes that it is a local attribute and tries to figure out an appropriate 
type for it considering the context in which it is used. 

In our example, t is first assigned the value returned by the term function, 
which has an output attribute result of type Int. We write this as follows: 

t <- result Int (1) 

Then, it is passed to the mul function as a value for an input parameter y of type 

Int: 

y Int <- 1 (2) 

These two usages facilitate the following reasoning: assuming that there exists a 
type r for t, such that the whole translation function is typed correctly, from (Q]i 
we see that Int must be a subtype of r, and from (0 we see that r must be 
a subtype of Int. Hence, r equals Int, and we have inferred the type for t 
successfully. 

The type checker in GRAMMATIC PG applies this kind of reasoning for every 
attribute which is used but not declared. In some cases this procedure does not 
lead to a definitive conclusion. In these cases the type checker reports an error, 
which can be reconciled by providing an explicit declaration of the attribute in 
question. 

Not only attribute types but also signatures of external functions can be in- 
ferred in this manner. For example, assume the following assignment appears in 
a specification: 

a = f (b, c) 

If f is not declared, GRAMMATIC PG assumes that there is an external function f 
with one output attribute and two input attributes, and applies the above reasoning 
to these attributes. If it succeeds, a complete type for f is inferred. 

We provide a more detailed description of type checking and type inference 
in GRAMMATIC PG in the next section. 

3. Type system. This section describes the extensible type system used in 
GRAMMATIC PG . The typing rules presented below are written under the assump- 
tion that all the attributes and external functions are declared explicitly. The pur- 
pose of type inference in this case is to reconstruct omitted annotations. If the 
reconstruction is not possible, the specification is considered to be inconsistent. 
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3.1. Typing rules. Let the implementation language be denoted by L. The 
types of the implementation language will be denoted Ql and referred to as 
ground types. Let Str G Ql be the ground type which represents character 
strings. The types used in GRAMMATIC PG are defined by the following pro- 
ductions: 

AttributeType ::= Qj, 

TupleType ::= {AttributeType*) 
FunctionType ::= TupleType — > TupleType 

As the names suggest, attributes may only have ground types, tuples are se- 
quences of attributes and have corresponding types, and functions send tuples to 
tuples. As we explained above, attribute tuples are used to receive return values 
of functions having more than one output attribute. Note that tuples can not be 
nested. Function arguments and individual return attributes can only have ground 
types: for example, a single argument can not be a tuple. 

A subtyping relation on the set Ql of ground types (denoted <l) represents 
subtyping rules of the implementation language. We assume that it is reflexive 
and transitive. We will use r >l a and a <l t interchangeably. 

Each translation function is type-checked separately. A type-checking 
context T comprises signatures of all functions available in the specification 
along with declarations of all input and output attributes of these functions 
and local attributes of the function being checked. Since attributes in different 
signatures may have same names, each attribute is indexed with the name of 
the function it belongs to. For example, a context may contain the following 
declarations: 

factor : (Environment) — > (Int) 
env f actor ■ Environment 
result factor : Int 

term : (Environment) — > (Int) 
env term : Environment 
result term : Int 

Figure [4] provides straightforward typing rules for token values, attributes 
and tuples and a rule for function application which says that a type of an argu- 
ment must be a subtype of the type of the corresponding formal parameter. 

In Figure [5] Correct Statement denotes all the statements in which typing 
rules are respected. 

The only nontrivial constraint is expressed by the rule ASSIGNMENT: the 
type of the right-hand side must be a subtype of the type of the left-hand side. As 
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the left-hand side may appear to be a tuple (as well as the right-hand side in case 
of function application), we use an extended subtyping relation >* L which is the 
minimal relation such that >lC>^ and 

(ai, • • • ,a„) >* L (&!,•• • ,b n ) iff (ai > L &i) A • • • A (a„ >l 6«) 

3.2. Type inference. The previous subsection formalizes the type system 
of GRAMMATIC PG under the assumption that every attribute and every function 
which is used is also declared explicitly. As we illustrated above, such declara- 
tions may be redundant in some cases, and GRAMMATIC PG (after many pro- 
gramming languages) provides a local type inference mechanism to enable omis- 
sion of some of them. 

Type inference works separately in each translation function. It represents 
all the statements as sequences of attribute assignments (function arguments are 
treated as "assigned" to input attributes) assigning unknown types to undeclared 
attributes. To reconstruct the declarations we use a modification of a conventional 
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algorithm (see, for example, ( [Pierce, 20021 1 ) which creates constraints (subtyp- 
ing inequalities) for unknown types and finds ground types which satisfy these 
constraints (see Section [33] for a simplistic example). Our modifications to the 
classical algorithm are not significant enough to formally present the entire type 
inference process. Instead, we will only illustrate the behaviour of our algorithm 
in case of an ambiguity. 

Let us assume that the implementation language has the following types: 
Ql = {Object, Integer}, where Integer <l Object. Consider the following 
function: 

f (Integer x) — > (Object result) { 
before ... : t = x; 
// . . . 

after ... : result = t; 



The local attribute t is not declared. Let r be the unknown type for t. The type 
inference algorithm will construct the following set of constraints: 

Integer <l t 1 
t >l Object J 

These constraints have at least two solutions: both Object and Integer might 
be assigned as a type for t. The very existence of a solution already means that 
the specification does not have type inconsistencies, but the back-end will need 
the exact type information to generate code, so we have to decide which type to 
choose. This may be important when inferring the types for external functions 
since they will be visible to the user (we provide details on the back-end below). 

In such cases GRAMMATIC PG prefers lower bounds to upper bounds, which 
means that t will be assigned the type Integer. The general procedure is the 
following: find a minimal solution satisfying all lower bounds, if it also satisfies 
all upper bounds, choose it as a final solution. If there are several (incomparable) 
minimal solutions for the lower bounds which satisfies all the upper bounds, the 
algorithm can not decide between them and reports an error. If no solution for 
lower bounds satisfies all the upper bounds, the specification is inconsistent, and 
we again report an error. If no lower bounds are present, we choose a maximal 
solution for the upper bounds. 

This procedure can be summarized as follows: we look for the type which is 
as close to the constraining ones as possible, preferring more concrete (smaller) 
types. This approach appears to be rather intuitive: it is very unlikely to infer a 
type which the developer does not expect. 
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If no constraints are present at all, this means that all the attributes connected 
to the one at hand are not declared. In this case we ask if the ground type system 
has a top type TopL (such as java . lang . Ob ject), and if it has one, we 
choose it, otherwise we report an error. 

3.3. Revisiting the initial example. Now we are ready to explain how the 
example from Section[T]is handled by GRAMMATIC . In that example we had 
a rule for factor analogous to the one in Listing [2] in which there was an error: 
a name of the INT token was used instead of its textual value (INT#). Here is 
the modification of Listing |2]containing the same defect: 

factor : VAR | INT | ' (' expr ')' ; 

factor (Environment env) --> (Int result) { 

after VAR : result = evaluate (env, VAR#) ; 

after INT : result = INT; 

at expr : result = expr (env) ; 

} 

Unlike ANTLR, GRAMMATIC PG reports the following error when we try to 
generate code: "The local attribute INT might have not been initialized" . What 
happened? INT appears on the right-hand side of an assignment. Thus, GRAM- 
MATIC PG expects it to be an attribute. Since such an attribute is not declared, the 
type checker treats it as a local attribute and infers a type for it: Int. For the time 
being, no error has been found. After the type checking, the definitive assignment 
analysis is performed, and it finds that the local attribute INT is read but never 
assigned. This leads to the error which is reported. 

Without generating the code and running a compiler, we have got an error 
message which points precisely to the place in the specification where the de- 
fect is situated. Following the logic of the example of Section Q] we correct the 
specification: 

after INT : result = INT#; 

Now the type checker complains: "Incompatible types: String and Int". Again, 
we have got a precise error message without generating code and running a com- 
piler. We correct the specification again: 

after INT : result = strToInt ( INT# ) ; 

This time all checks are passed successfully, the code is generated and will be 
compiled with no errors. 

As can be seen, the development cycle now takes the form shown in Figure|2] 
(right side), which was our main goal. Now we proceed to a description of the 
tools GRAMMATIC PG provides to support many implementation languages. 
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4. Declarative descriptions of type systems. The type checking procedure 
described above is parameterized by a set of ground types Ql with distinguished 
types String^ and TopL, and a subtyping relation <l. To incorporate a type 
system of the implementation language into GRAMMATIC PG , one needs to sub- 
stitute concrete implementations for these parameters. In general, this is done by 
adding front-end extensions (plug-ins written in Java), which provide support for 
particular implementation languages. Developing such extensions requires some 
effort and may be undesirable in certain cases. For this reason GRAMMATIC 
provides a default extension which supports declarative descriptions of type sys- 
tems of implementation languages. 

A type system description specifies a set of named types, a subtyping relation 
on these types, an optional top type and a string type. For example, the type 
system used above may be described as follows: 

typesystem Simple { 

_, // Name of the top type (nothing in this case) 

String // Name of the string type 

) { 

// Type declarations 
type Int; 
type Environment; 
type Object; 

// Subtyping rules 
Environment <: Object; 
String <: Object; 

} 

This denotes a type system named Simple, with no top type (underscore is 
used to denote this, if there were a top type, its name would have been writ- 
ten) and with the string type called String. It declares three more types: Int, 
Environment and Object. The first two were used above, but the third one 
is added only for demonstration purposes, namely to introduce subtyping rules 
which state that Environment and String are subtypes of Object. Note 
that since Object is not the top type, Int is not its subtype. In a general case, 
subtyping rules stated in a type system description form an incomplete form of a 
subtyping relation. A final subtyping relation is obtained as a reflexive-transitive 
closure of it. 

Type system descriptions such as the one shown above are sufficient to check 
GRAMMATIC PG specifications for type safety and infer types, but they are not 
sufficient to generate code unless the implementation language has types with 
exactly the same names as used in a description. The latter is unlikely because 
normally types are defined inside some kind of namespaces, such as Java pack- 
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ages, and can not be referred to by a simple name without import statements. 
Thus, we have to provide another description which instantiates the types with 
their actual form in the implementation language. We call this a language de- 
scription: 

language Java for Simple ( 
Int = ' int' ; 

Environment = ' java . util .Map<String / Integer>' ; 
String = ' String' ; 
Object = 'Object'; 

} 

The example shows a language description named Java which instantiates the 
type system Simple and says that Int is implemented as int in Java, and 
Environment is implemented as a map from strings to integer objects. We 
might not specify the instantiations for String and Object since they may be 
referred to just by their names. 

Now a back-end can use the strings we have provided in generated code, 
and it will work correctly (if the back-end generates Java and not C or some 
other language). A back-end may need some extra information such as a package 
to put the generated code in and a name to give to the generated parser. These 
options are provided in a back-end profile, such as 

backend ' org . grammatic . pg . backends . ANTLRJavaBackend' for Java { 
package = ' org . example . arithexp' ; 
parserName = ' ExpressionEvaluator' ; 

} 

This is a profile for a back-end implemented by a Java class 
ANTLRJavaBackend which applies for the language description Java de- 
fined above. In the profile one simply writes name-value pairs which are pro- 
cessed by the back-end as options. 

To summarize, the declarative descriptions of type systems in GRAM- 
MATIC PG are organized into three levels (see Figure|6]l. 



Parser 
Specification 



Type System 



Yacc 




ANTLR 



CUP 



Fig. 6. Three levels of type system descriptions 
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This makes GRAMMATIC rather flexible when it comes to multi-targeted 
parser specifications, form which parsers in many implementation languages 
must be generated. In such a case one describes an abstract type system as shown 
above and provides many language descriptions for it, so that each back-end 
profile may use its own language. In some situations it is also convenient to have 
several back-end profiles for the same language, for example, when one needs to 
compare performance of different implementations or while migrating from one 
back-end to another. 

5. Language-specific front-end extensions. To provide tighter integration 
with a particular implementation language one can use a language-specific ex- 
tension of the GRAMMATIC PG front-end instead of the default one described 
in the previous section. Technically, a front-end extension consists of a ground 
type syntax specification and implementations of Java interfaces which capture 
the semantical aspects of ground types: a subtyping relation, a set of predefined 
types and two distinguished types. Let us describe these parts in more details. 
Examples below present an extension supporting Java types (including generics), 
which we developed in our prototype. 

5.1. Syntax of ground types. The core specification of the Gram- 
MATIC PG notation (see Listing|6j> has extension points: it uses but does not define 
two nonterminal symbols, type and declaration (written in bold font in the 
listing). The language generated by type is a syntactical form of Ql. Since 
the specification parser in GRAMMATIC PG is itself implemented using GRAM- 
MATIC PG , type must be defined by grammar rules and translation functions. 
This is done in a separate specification file which is virtually "appended" to the 
generic specification when the whole system is built. 

specification 

: declarations? 

(externalFunctionSignature I (grammarRule translationFunction* ) ) * ; 
attributeDeclaration 
: type NAME ; 

Listing 6: Extension points in GRAMMATIC PG notation 

A syntactic function for type must return an instance of 
java . lang . Ob ject (GRAMMATIC PG is implemented in Java), in other 
words, a type may be represented by an arbitrary object. The context-free rules 
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for types in Java 5 are given in Listing [7] For the sake of brevity we do not 
include the corresponding translation functions. 

type 

: IDENTIFIER typeArguments? ('.' IDENTIFIER typeArguments?) * ('[' ']')* 

: basicType ; 
typeArgument s 

: '<' typeArgument (',' typeArgument)* ' >' ; 
typeArgument 

: type 

: '?' (('extends' I 'super') type)? ; 
basicType 

: 'byte' I 'short' I 'char' | 'int' I 'long' | 'float' I 'double' 
I ' boolean' ; 

Listing 7: Ground type syntax for Java 5 

With the rules from Listing |7]used for defining the syntax of ground types, a 
signature of the evaluate function may be the following: 

evaluate 

( j ava . ut li . Map< java . lang. String, java.lang.Integer> environment) 
— > 

( java . lang . Integer result); 

As can be seen, fully qualified class and interface names and generics can be used 
as types for attributes. 

5.2. Declarations. The names in the example above are quite long which 
is inconvenient. In Java this problem is solved with the help of imports. GRAM- 
MATIC PG does not know about the structure of Java types and can not support 
imports itself. Instead, it provides a generic mechanism for adding arbitrary dec- 
larations which are specific for an implementation language. The syntax for dec- 
larations is defined by the declarations nonterminal. A translation function 
for it does not take or accept any attributes: it is supposed to collect the informa- 
tion about the declarations and store it internally to be available when somebody 
else (e.g., the translation function for types) needs it. 

To support imports we can define declarations as follows: 

declarations 

: importDeclaration* ; 
importDeclarat ion 

: 'import' IDENTIFIER ('.' INDENTIFIER) * ('.' '*')? ; 
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declarations 

: options? importDeclaration* ; 
options 

: ' # javaoptions' '{' optiont '}' ; 
option 

: NAME ' =' STRING ' ; ' ; 
importDeclaration 

: 'import' IDENTIFIER ('.' INDENTIFIER) * ('.' '*')? ; 

Listing 8: Declaration syntax for Java 5 



public interface ISubtypingRelation<T> { 

boolean isSubtypeOf (T type, T supertype) ; 

} 

public interface ITypeSystem<T> { 

ISubtypingRelation<T> getSubtypingRelation ( ) ; 

Set<T> getPredef inedTypes ( ) ; 

T getTopType ( ) ; 

T getStringType () ; 

} 

Listing 9: Java interfaces describing semantics of ground types 



The corresponding translation functions will collect the information about im- 
ported types and provide it to the translation function for types. Now we can 
use Java imports in GRAMMATIC PG specifications, for example 

import java . util . Map; 

evaluate (Map<String, Integer> environment) — > (Integer result); 

The complete syntax of declarations which we use for Java is given in List- 
ing [8] In addition to imports, it supports options which are used by the back-end 
and specify auxiliary information (this corresponds to back-end profiles of the 
default extension). 

5.3. Semantics of ground types. Semantics of ground types is provided by 
Java classes that must implement the interfaces shown in Listing [9] 

A subtyping relation is represented by a class which implements 
ISubtypingRelat ion<T> interface, where the type parameter must be sub- 
stituted by a class which is used by the extension to internally represent types, 
and the isSubtypeOf method returns true if and only if the following condi- 
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tion holds: 

type <£ supertype 

For example, our implementation represents Java types using the 
EGenericType abstraction from Eclipse Modeling Framework 
( |Budin sky, 2003). In this case the subtyping relation class is declared as 
follows: 

public class JavaSubtypingRelation 

Implements ISubtypingRelation<EGenericType> 

The other interface, ITypeSystem has methods which return a subtyping 
relation represented as discussed above, a set of predefined types, a top type (or 
null if no top type exists in the ground type system) and a type for character 
strings. A back-end must use the toSt ring ( ) method of type objects to obtain 
their textual representation. 

6. Back-end. By now we have presented an extensible front-end which can 
detect typing errors in specifications. Here we will explain why detecting these 
errors is sufficient for the back-end to be able to generate error-free code. The 
techniques described below apply to virtually any language, and we believe that 
whenever the front-end can be extended to support a particular implementation 
language, a corresponding back-end can be developed^. 

Generating error-free code from a grammar with no semantic actions is rel- 
atively easy. The problems arise when we need to incorporate hand-written code 
fragments into the generated program. GRAMMATIC PG front-end guarantees that 
the actions do not contain errors themselves and thus the errors may be caused 
only by conflicts between hand-written and generated code. For example, seman- 
tic actions may introduce names which are already used in the generated code or 
require particular imports. 

The peculiar property of this sort of errors is that the back-end can always 
detect and prevent them while reading the internal representation of the specifi- 
cation. This is because the back-end has total control over the generated code. 
For example, to avoid name clashes, it is sufficient to rename variables, which 
the back-end can do. 

In the case when the back-end generates ANTLR specifications with seman- 
tic actions in Java, we have to prevent the following types of errors: 

'Currently a back-end must be programmed manually in Java. This is because the issues de- 
scribed in this sections are rather peculiar for each implementation language and for the moment 
we do not see an efficient way to abstract them into a reusable framework. Such a framework is an 
interesting direction for the future research. 
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• A name used in the specification is a Java or ANTLR keyword. 

• A name is used internally by ANTLR. 

• A type requires some classes or interfaces to be imported. 

Other types of errors are either prevented by the analysis performed by the front- 
end (e.g., usage of uninitialized variables) or do not appear because of the partic- 
ular structure of the code generated by ANTLR. 

Naming problems are easy to prevent since it is sufficient to use a fresh name 
which can be obtained by adding numbers to original names (e.g., result 1). 
The code remains readable enough and no errors appear. The import problem is 
also fixed straightforwardly: we can always import all the classes ever mentioned 
in the specification or use fully qualified names if some short names clash. 

Correctness of external function signatures is guaranteed by the follow- 
ing design of generated parsers. Along with an ANTLR specification, GRAM- 
MATIC PG generates a Java interface which has a method for each external func- 
tion used in the specification. This interface must be supported in order to im- 
plement the functions, which makes Java compiler to check if the functions are 
declared properly in the implementation. This interface being read and imple- 
mented by a human makes us select the most appropriate types during the type 
inference process (see Section [Jl2l ). 

To summarize, we have demonstrated how a back-end can prevent all the 
errors which are imposed by the structure of the generated code (and thus not 
present in the specification, and not checked by a front-end). These techniques 
vary over implementation languages, but the essence stays unchanged: we can 
always generate code in such a way that these errors do not appear. Thus, we 
have reached our goal: as soon as a specification is successfully type-checked, 
the generated code is error-free. 

7. Related work. We are not aware of any parser generators which support 
multiple implementation languages and have type checking in their front-ends. 

The most popular tools supporting multiple implementation languages are 
ANTLR dParr, 2007| > and COCO/R ( |Mossenbock, 1990D , and they do not per- 
form any type checking in the front-ends. In most cases these tools can not gen- 
erate code in different specification languages from the same specification (the 
specifications, thus, are not multi-targeted in these systems) , because the em- 
bedded actions are written in a particular implementation language and will not 
compile in another one. In SableCC (Gagnon, 1998) specifications are multi- 
targeted, which is achieved by having no semantic actions: a developer has to 
manually process parse trees using visitors. In contrast, GRAMMATIC PG sup- 
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ports both multi-targeted specifications (using type system descriptions) and se- 
mantic actions. 

The following attribute grammar systems are capable of reducing the devel- 
opment cycle to the one shown in Figure [2] (right side): Eli (Gray, 1992 1 auto- 
matically tracks compiler errors back to the specification. This approach is tied 
up not only to a specific implementation language (Eli uses C), but also to a spe- 
cific implementation of its compiler, since the format of error messages usually 
varies from one compiler to another. The team behind JastAdd (Hedi nT"2003| l 
system plans to integrate their own implementation of a Java compiler into it, to 
check semantic actions. This approach is also tied up to a specific implementa- 
tion language. None of these systems provide appropriate means by which they 
could be extended for other implementation languages. GRAMMATIC PG allows 
for this via front-end extensions and separated back ends. 



8. Conclusion. This paper addresses the problem of type checking in front- 
ends of parser generators supporting multiple implementation languages. The 
main goal is to prevent typing errors in the generated code to avoid the need of 
manually tracing such errors back to their causes in the specification . 

We have demonstrated that type checking of the specifications, which we im- 
plemented in a prototype tool GRAMMATIC PG , helps to reduce the development 
cycle compared to the one imposed by the tools currently available (see Figure|2]i. 
Our approach is designed to be extensible for use with multiple implementation 
languages. 

The principle contributions of this paper are the following: 

• A GRAMMATIC PG specification language supporting 

- semantic actions, but having no problem of tangling between grammar 
rules and action code; 

- extensions to support type systems of implementation languages. 

• A type checking procedure for this language, supporting local type 
inference, compatible with the extensions. 

• A generic extension for declarative definitions of abstract type systems, 
their syntactical realizations for particular language and configuration 
profiles for different back-ends, which in complex enable multi-targeted 
specifications. 

• Another extension providing tight integration with Java. 

We have also reported on a prototype back-end for generating ANTLR/Java, 
which, we believe, never produces erroneous code from successfully type- 
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checked specifications. 

One possible direction to continue this work is to investigate a possibility 
of declaratively specifying back-ends for particular implementation languages to 
obtain a complete generator from declarative specifications. Another direction 
will be to integrate grammar inspections (such as heuristic ambiguity tests) into 
the static checking procedure. 
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